Biostatistics For Dummies, 2nd Edition (Monika Wahi, John Pezzullo)

CHAPTER 17 More of a Good Thing: Multiple Regression 237

a binary Diabetes variable, where 35 were coded as yes and the rest were no, and

so on for Cancer and Other.

Similarly, if your model has two categorical variables with an interaction term (like

Setting + Primary Diagnosis + Setting * Primary Diagnosis), you should prepare a

two-way cross-tabulation of the two variables first (in our example, Setting by

Primary Diagnosis). You will observe that you are limited by having to ensure that

you have enough rows in each cell of the table to run your analysis. See Chapter 12

for details about cross-tabulations.

Choosing the reference level wisely

For each categorical variable in a multiple regression model, the program consid-

ers one of the categories to be the reference level and evaluates how each of the

other levels affects the outcome, relative to that reference level. Statistical soft-

ware lets you specify the reference level for a categorical variable, but you can also

let the software choose it for you. The problem is that the software uses some

arbitrary algorithm to make that choice (such as whatever level sorts alphabeti-

cally as first), and usually chooses one you don’t want. Therefore, it is better if you

instruct the software on the reference level to use for all categorical variables. For

specific advice on choosing an appropriate reference level, read the next section,

“Recoding categorical variables as numerical.”

Recoding categorical variables as numerical

Data may be stored as character variables — meaning the variable for primary

diagnosis (PrimaryDx) may be contain character data, such as Hypertension, Diabe-

tes, Cancer, and Other. Because it is difficult for statistical programs to work with

character data, these variables are usually recoded with a numerical code before

being used in a regression. This means a new variable is created, and is coded as 1

for hypertension, 2 for diabetes, 3 for cancer, and so on.

It is best to code binary variables as 0 for not having the attribute or state, and 1

for having the attribute or state. So a binary variable named Cancer should be

coded as Cancer = 1 if the participant has cancer, and Cancer = 0 if they do not.

For categorical variables with more than two levels, it’s more complicated. Even if

you recode the categorical variable from containing characters to a numeric code,

this code cannot be used in regression unless we want to model the category as an

ordinal variable. Imagine a variable coded as 1 = graduated high school, 2 = gradu-

ated college, and 3 = obtained post-graduate degree. If this variable was entered

as a predictor in regression, it assumes equal steps going from code 1 to code 2,

and from code 2 to code 3. Anyone who has applied to college or gone to graduate

school knows these steps are not equal! To solve this problem, you could select